An Empirical Investigation of the Impact of Discretization on Common Data Distributions

نویسندگان

  • Michael Ismail
  • Victor Ciesielski
چکیده

This study attempts to identify the merits of six of the most popular discretization methods when confronted with a randomly generated dataset consisting of attributes that conform to one of eight common statistical distributions. It is hoped that the analysis will enlighten as to a heuristic which identifies the most appropriate discretization method to be applied, given some preliminary analysis or visualization to determine the type of statistical distribution of the attribute to be discretized. Further, the comparative effectiveness of discretization given each data distribution is a primary focus. Analysis of the data was accomplished by inducing a decision tree classifier (C4.5) on the discretized data and an error measure was used to determine the relative value of discretization. The experiments showed that the method of discretization and the level of inherent error placed in the class attribute have a major impact on classification errors generated post-discretization. More importantly, the general effectiveness of discretization varies significantly depending on the shape of data distribution considered. Distributions that are highly skewed or have high peaks tend to result in higher classification errors, and the relative superiority of supervised discretization over unsupervised discretization is diminished significantly when applied to these data distributions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Investigation of the Impact of Intellectual Capital on Firms’ Market Value and Financial Performance: Evidence from Iranian Companies

In modern economics, Intellectual capital is described as an intangible asset which can be used as a source of sustainable competitive advantage. However, intellectual capital components have to interact in themselves to create value. The paper seeks to examine the impact of intellectual capital on firms’ market value and financial performance. The efficiency of the value added by corporate int...

متن کامل

The Impact of Trade Liberalization on Industrial Growth of India: An Empirical Investigation

  This paper examines impact of trade liberalization on industrial growth of India. The research problem is expressed as “To what extent does trade liberalization or openness of the economy influence industrial growth of India?” To identify the impacts of trade liberalization, total time period, 1970-2010, is divided into two sub periods of before trade liberalization i.e. (1970 to 1990) an...

متن کامل

Skew-slash distribution and its application in topics regression

In many issues of statistical modeling, the common assumption is that observations are normally distributed. In many real data applications, however, the true distribution is deviated from the normal. Thus, the main concern of most recent studies on analyzing data is to construct and the use of alternative distributions. In this regard, new classes of distributions such as slash and skew-sla...

متن کامل

EMPIRICAL BAYES ANALYSIS OF TWO-FACTOR EXPERIMENTS UNDER INVERSE GAUSSIAN MODEL

A two-factor experiment with interaction between factors wherein observations follow an Inverse Gaussian model is considered. Analysis of the experiment is approached via an empirical Bayes procedure. The conjugate family of prior distributions is considered. Bayes and empirical Bayes estimators are derived. Application of the procedure is illustrated on a data set, which has previously been an...

متن کامل

Providing of some family of continuous distributions to fit the best distribution on drought data of Guilan state and investigation of duration of the drought on this stat

In this paper, after stating the characteristicof some of continuous distributions including, gamma, Crovelli’s gamma, Rayleigh, Weibull, Pareto, exponential and generalized gamma distribution with each other,these distributions were fit on drought data of Guilan state and the best distribution was presented. Then, severity and duration of the drought of different sites were investigated usi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003